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Abstract 

We study the accuracy of Bayesian supervised method used to cluster individuals into 
genetically homogeneous groups on the basis of dominant or codominant molecular markers. 
We provide a formula relating an error criterion the number of loci used and the number 
of clusters. This formula is exact and holds for arbitrary number of clusters and markers. 
Our work suggests that dominant markers studies can achieve an accuracy similar to that of 
codominant markers studies if the number of markers used in the former is about 1.7 times 
larger than in the latter. 

1 Background 

A common problem in population genetics consists in assigning an individual to one of K popula- 
tions on the basis of its genotype and information about the distribution of the various alleles in 
the K populations. This question has received a considerable attention in the population genetics 
and molecular ecology literature [TJ EJ |3l 0] as it can provide important insight about gene flow 
patterns and migration rates. It is for example widely used in epidemiology to detect the origin 
of a pathogens or of their hosts (see e.g. EE E] for examples) or in conservation biology and 
population management to detect illegal trans- location or poaching [5]. See [S] for a review of 
related methods. 

In a statistical phrasing, assigning an individual to some known clusters is a supervised 

clustering problem. This requires to observe the genotype of the individual to be assigned and 
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those of some individuals in the various clusters. For diploid organisms (i.e. organisms harbouring 
two copies of each chromosome), certain lab techniques allow one to retrieve the exact genotype 
of each individual. In contrast, for some markers it is only possible to say whether a certain allele 
A (referred hereafter as to dominant allele) is present or not at a locus. In this case, one can not 
distinguish the heterozygous genotype Aa from the homozygous genotype AA for the dominant 
allele. The former type of markers are said to be codominant while the latter are said to be 
dominant. It is clear that the the second genotyping method incurs a loss of information. The 
consequence of this loss of information has been studied from an empirical point of view [10] but it 
has never been studied on a theoretical basis. The choice to use one type of markers for empirical 
studies is therefore often motivated mostly by practical considerations rather than by an objective 
rationale I12j. The objective of the present article is to compare the accuracy achieved with 
dominant and codominant markers when they are used to perform supervised clustering and 
to derive some recommendations about the number of markers required to achieve a certain 
accuracy. Dominant markers are essentially bi-allelic in the sense that they record the presence of 
the absence of a certain allele. We are not concerned here by the relation between informativeness 
and the level of polymorphisms (cf |13} E] for references on this aspect). We therefore focus on 
bi-allelic dominant and co-dominant markers. Hence our study is representative of Amplified 
Fragment Length Polymorphism (AFLP) and Single Nucleotide Polymorphism (SNP) markers, 
which are some of the most employed markers in genetics. 

2 Informativeness of dominant and co-dominant markers 
2.1 Cluster model 

We will consider here the case of diploid organisms at L bi-allelic loci. We denote by z = 
{zi)l=l,... ,L the genotype of an individual. We denote by fki the frequency of allele A in cluster k 
at locus I We assume that each cluster is at Hardy- Weinberg equilibrium (HWE) at each locus. 
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HWE is defined as the conditions under which the allele carried at a locus on one chromosome 
is independent of the allele carried at the same locus on the homologous chromosome. This 
situation is observed at neutral loci when individuals mate at random in a cluster. Denoting by 
zi the number of copies of allele A carried by an individual, we have: For co-dominant markers, 
this can be expressed as 

p(zj = 2|/) = ff (1) 
p(zi = l\f) = 2/|(l - /,) (2) 
p( Zl = 0\f) = (I-/,) 2 (3) 

For dominant markers, z\ is equal to or 1 depending on whether a copy of allele A is present in 
the genotype of the individual. Under HWE we have: 

p(z, = l|/) = /, 2 + 2/,(l-/i) (4) 
p(z, = 0|/) = (I-/,) 2 (5) 

In addition to HWE, we also assume that the various loci are at linkage equilibrium (henceforth 
HWLE), i.e. that the probability of a multilocus genotype is equal to the product of probabilities 
of single-locus genotypes: p(z\, z£) = Y\iP( z i)- We assume that the individual to be classified 
has origin in one of the K clusters (no admixture) . 

2.2 Sampling model 

We will measure the accuracy of a classifying rule for a given type of markers by the probability 

to assign correctly an individual with unknown origin. We are interested in deriving results that 

are independent (i) on the particular origin c of the individual to be classified (ii) on the genotype 

z of this individual and (iii) on the allele frequencies / in the various clusters. We will therefore 

derive results that are conditional on c, z and / and then compute Bayesian averages under 

suitable prior distributions. The mechanism assumed in the sequel is as follows 
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1. The individual has origin in one of the K clusters. This origin is unknown and all origins 
are equally likely. We therefore assume a uniform prior for c on {1, K}. 

2. In each cluster, for each locus the allele frequencies follow a Dirichlet(l,l) distribution with 
independent across clusters and loci. 

3. Conditionally on c and /, the probability of the genotype of the individual is given by 
equations ([l][3]) or (|4]{5]), i.e we assume that the individual has been sampled at random 
among all individuals in his cluster of origin. 

2.3 Accuracy of assignments under a maximum likelihood principle 

We consider an individual of unknown origin c with known genotype z with potential origin in 
K clusters with known allele frequencies. Following a maximum likelihood principle, it is natural 
to estimate c as the cluster label for which the probability of observing this particular genotype 
is maximal. Formally: c* = Argmaxkp(z\c = k,fk)- This assignment rule is deterministic, but 
whether the individual is correctly assigned will depend on its genotype and on cluster allele 
frequencies. Randomising these quantities and averaging over all possible values, we can derive 
a generic formula for the probability of correct assignment p MLA as 

p MLA = [J2?n & xp(c = 1 ,z = t\f = ip)dp(v) (6) 

Jip ^ T 

See section [A] in appendix for details. This formula is of little practical use and deriving some 
more explicit expression for arbitrary value of K and L seems to be out of reach. However, for 
K = 2 and L = 1 , under the assumptions that the individual has a priori equally likely ancestry 
in each cluster and that each fk has a Dirichlet distribution with parameter (1, 1) (flat), we 
get 

p MLA^ K = 2, L = 1) = 17/24 for codominant markers (7) 
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and 

pf LA (K = 2, L = 1) = 16/24 for dominant markers. (8) 

Because of the lack of practical usefulness of eq. ([6]), we now define an alternative rule 
for assignment that is similar in spirit to maximum likelihood but also leads to more tractable 
equations. 

2.4 Accuracy of assignments under a stochastic rule 

Considering the collection of likelihood values p(z\c = k, /&) for k = 1,...,K, following [To], we 
define a stochastic assignment (SA) rule by assigning the individual to a group at random with 
probabilities proportional to p(z\c = k, In words, an individual with genotype z is randomly 
assigned to cluster k with a probability proportional to the probability to observe this genotype 
in cluster k. The rationale behind this rule is that high values of p(z\c = k,fk) indicate strong 
evidence of ancestry in group k but do not guarantee against miss-assignments. To derive the 
probability of correct assignment, we first consider that the allele frequencies are known, and 
the account for the uncertainty about these frequencies by Bayesian in integration. The use of 
a Bayesian framework is motivated by the fact that (i) there is genuine uncertainty on allele 
frequencies which can not be overlooked, and (ii) under some fairly mild assumptions, allele 
frequencies are known to be Dirichlet distributed (possibly with a degree of approximation see 
e.g. [HUE]). Refer to [E] for further discussion of the Bayesian paradigm in population genetics. 
We now give our main results regarding this clustering rule. 

For bi-allelic loci and denoting by p~ the probability of correct assignment using codominant 
markers we have: 

For bi-allelic loci and denoting by p^ A the probability of correct assignment using dominant 
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markers is 

^K'Q- T+w^mm* (10) 

3 Implications 

Our investigations considered bi-allelic loci and are therefore representative of AFLP and SNP 
markers which are some of the most employed markers in genetics. In this context, for supervised 
clustering, our main conclusions are that (i) codominant markers are more accurate than domi- 
nant markers, (ii) the difference of accuracy decreases toward as the number of markers L in- 
creases, (iii) Ld dominant markers can achieve an accuracy even higher than that of L c codominant 
markers as long as the numbers of loci used satisfy > XL C where A = ln(5/8)/ ln(25/33) ~ 1.69. 

The figures reported have to be taken with a grain of salt as they may depend on some specific 
aspects of the models considered. For example, the model considered here assumes independence 
of allele frequencies across clusters. This assumption is relevant in case of populations display- 
ing low migration rates and low amount of shared ancestry. When one of these assumptions is 
violated, an alternative parametric model based on the Dirichlet distribution that accounts for 
correlation of allele frequencies across population is often used (see [16] and references therein). 
It is expected that the accuracy obtained with both markers would be lower under this model. 
Besides, the present study does not account for ascertainment bias [US [2U1 EH E2] , an aspect that 
might affect the results but is notoriously difficult to deal with. However, it is important to note 
that the conditions considered in the present study were the same for dominant and codominant 
markers so that results should not be biased toward one type of marker. Our global result about 
the relative informativeness of dominant and co-dominant markers contrasts with the common 
belief that dominant markers are expedient one would resort to when co-dominant markers are 
not available (see [12] for discussions). 

A comparison of dominant and codominant markers for unsupervised clustering has been carried 
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out |23j . This study based on simulations suggests that the loss of accuracy incurred by dom- 
inant markers in unsupervised clustering is much larger than for supervised clustering. This is 
presumably explained by the fact that in case of HWLE clusters, supervised clustering seeks to 
optimise a criterion based on allele frequencies only. This contrasts with unsupervised clustering 
which seeks to optimise a criterion based on allele frequencies and HWLE. A similar theoretical 
analysis of unsupervised clustering algorithm similar to the present study would be valuable but 
we anticipate that it would present more difficulties. 
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A Supervised clustering with a maximum likelihood principle 

We consider the setting where the unknown ancestry c of an individual with genotype z is 
estimated by c* = Argmax c p(z|c, /). As this estimator is a deterministic function of z we denote 
it by c* for clarity in the sequel. Consider for now that the allele frequencies / are known to be 
equal to some tp. Under this setting, randomness comes from the sampling of c and then from 
the sampling of z\(c, /). We are concerned with the event £ defined as 

£ = {the individual is correctly assigned} 
. Applying the total probability formula, we can write 

p(£\f = <f) = J2 E^< c = 7, * = CI/ = ¥>) (11) 

7 C 

In the sum over 7, only one term is not equal to 0, this is the term for 7 = 0*, hence 



p(S\f = <p) = ^2p(S,c = cl,z = C\f = <p) (12) 

C 

= ^T P (c = c* z = C\f = v) (13) 

c 

= Y,p( c = c c\f = v)p( z = (\ c = c hf = ( P) ( 14 ) 
c 

= ^ p (c = <%)p(z = C\c = clf = <p) (15) 

c 

Assuming that the individual has a priori equally likely ancestry in each cluster, i.e. assuming 
a uniform distribution for the class variable c, we get 

P (£\f = <p) = K- 1 = C|c = c* f = ip) (16) 

C 

By definition, c* satisfies p(z\c* z , f) = max 7 p(z|c = 7, /), hence 

p(£\f = <p) = K~ l Vmaxp(z = C|c = 7,/ = ^ (17) 

c 7 



^maxp(c = i)p(z = (\c = 7,/ = ip) (18) 



C 7 

= ^maxp(c = 7,2r = C|/ = (y9) (19) 
C 7 

We seek an expression of the probability of correct assignment that does not depend on 
particular values of allele frequencies. This can be obtained by integrating over allele frequencies, 
namely 



p(£) = / p(£\f = <p)dp(<p) (20) 

J if 

= J ^max])(c = 7,z = Qf = <p)dp(ip) (21) 
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Note that identity (21 ) holds for any number of cluster K, any number of loci L and any type 



of markers (dominant vs. codominant). 

We now consider a two cluster problem in the case where the genotype of an individual has 
been recorded at a single bi-allelic locus. We denote by f\ (resp. $2) the frequency of allele A in 
cluster 1 (resp. cluster 2). 

A.l Codominant markers: 

There are only three genotypes: AA, Aa and aa. Denoting by the frequency of allele A in 
cluster k and conditionally on these three genotypes occur in cluster k with probabilities 



/|, 2/^(1 — fk) and (1 — fk) , and equation (21) can be simplified as 
p(£) = / P(c) 



max/ + 2max/ 7 (l — / 7 ) + max(l — / 7 V 



(22) 



We need to derive the distribution of max 7 and of max 7 / 7 (l — / 7 ). Assuming a flat 
Dirichlet distribution for f^, elementary computations give: 



p(max f k <x)=x (23) 



.e maxfc /? follows a uniform distribution on [0, 1] so that 



E(max/|) = l/2 

k 



(24) 



Besides, we also get 



p(max/ fc (l - f k ) < x) = (1 - \/l - 4x) 2 

k 



(25) 



and deriving 



_(n^/ fc (l -/*)<*) =4-^==- 



(26) 



Integrating by part, we get 



E(max/ fe (l-/ fc ))= / 4x rix = 5/24 



(27) 



Eventually 



= 17/24 



(28) 



which proves equation Q 

A. 2 Dominant markers: 



□ . 



For a single locus, there are two genotypes A and a. Conditionally on these two genotypes 



are observed in cluster k with probabilities 1 — f? and /?. Equation (21) can be simplified here 



p( £ ) = / p( c 

J ID 



max/ 7 + max(l - / 7 ) 



dp(<p) 



(29) 



We now need the density of 1 — /? 



p(max(l - /|) < x) = (1 - v 7 !^) 2 



(30) 
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^(max(l - fl) <x) = - 7 =L= - 1 (31) 
ax k y/i — x 



and 



E(max(l - fl)) = £ x - lj dx = 5/6 



(32) 



Eventually we get 

p{£) = 16/24 (33) 
which proves equation ^ □. 

B Stochastic assignment rule 

The maximum likelihood assignment rule considered above is not tractable for arbitrary values 



of K and L (cf. eq. (21)). In particular, a difficulty arises from the maximisation involved. We 
consider here an assignment rule that does not involve maximisation. The unknown ancestry c of 
an individual with genotype z is predicted by a random variable c* with values in { 1 , . . . , K} and 
such that p(c* = k\z, f) oc p(z\c = k, /). As in the previous sections, we first consider that the 
allele frequencies are known, however we skip this dependence in the notation at the beginning 
for clarity. We will account for the uncertainty about these frequencies later by Bayesian in 
integration. In this setting, the structure of conditional probability dependence can be represented 
by a directed acyclic graph as in the on left-hand side of figure [TJ 

We are concerned with evaluating the probability of event £ defined as 

£ = {the individual is correctly assigned} 

. i.e. £ = {c = c*}. We denote by p a (resp. pb) probabilities under the two conditional dependence 
structure of figure [TJ Some elementary computations show that p(£) can be expressed in terms 
of a probability in the model of the right-hand-side of the DAG in figure [TJ namely: 

p a (c = c*) = p h (c = c\z = z) (34) 
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(b) 

Figure 1: Directed acyclic graph for our stochastic assignment rule (left) and for an alternative 
scheme (right). All downward arrows represent the same conditional dependence given by our 
likelihood model. Upward arrow represents the reverse probability dependence. 

The left-hand-side of this expression can be written as 



p b (c = c'\z = z') = p b (c = c',z = z')/p b {z = z') 



(35) 



It is more convenient to manipulate this expression than p b (c = c*). We will to use it to evaluate 

Pa{S). 

B.l Codominant markers: 

We assume that the individual has a priori equally likely ancestry in each cluster. We slightly 
change the notation denoting by z\ the count of allele A at locus I for the individual to be assigned. 
Then making the dependence on / explicit in the notation, we have 



p b (c = c',z = z'\f) = ££pg(c,z|/) 



z k 



EE 

z k 



^n^(i-A) 2 - 2; (2-^ 



where 8\ denotes the Kronecker symbol that equals 1 if z\ = 1 and otherwise. 
Accounting for uncertainty about / by integration, we get 



p b (c = c',z = z') = Jpb( c = c',z = z'\f)df 



/EE ^n/?(i-A) 2 -"(2-^,) 

j j z k L i 



df 



(36) 
(37) 



(38) 
(39) 
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Among the terms enumerated in the sum over z above, let us consider a generic term z for 
which the number of loci having exactly h heterozygous genotypes. The term corresponding to 
such a genotype z in the sum above can be written 



f(l-f) 2 df 



f 



fdf 



L-h 



(40) 



Denoting by C£ the binomial coefficient, there are C^2 L such terms. Equation (39) becomes 



Pb (e = c',z = z>) = ££^2^2^ 



/ 2 (1 - ffdf 



f 



h 



h k 

Assuming a flat Dirichlet distribution for the allele frequencies, we get 

L 



fdf 



L-h 



p b (c = c',Z = z') 



We now need to evaluate pb(z = z'), but since 



1 

K V 15 



Pb (z = zV) = = Wx^cm/)) > 

z z \ k / 



(41) 



(42) 



(43) 



Vb{z = z') 



l E ( E^( fc < z \tf + E p<>( k > z \f)pb( k '> z \f) 



p b (c = c',z = z) + 



E ch ^ L - h ( E 2 (^ 2h [//a - w] 2h f / 



1 

X V 15 



+ 



K- 1 1 
if 3^ 



4(L-/i) x 



(44) 
(45) 

(46) 
(47) 



Eventually, 



p(f) 



l + (if 
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l)(f) 



5\ L 



(48) 



which proves equation 



□ 



B.2 Dominant markers: 



We still have 



p b (c = c',z = z') 



/ee ^nxo-hr**?-* 1 *) 

J f z k L i 



df 



(49) 



For a generic genotype z in the sum above, let us denote by r the number of loci carrying 
exactly one copy of the recessive allele, then 



p b (c = C>,Z = Z>) = Y.Y, C r^ 



r k 
r k 

- (— 

K\15 



K 2 

r K 2 \5 
L 



f£df 



f 

L-r 



;i - fiYdf 



L-r 



Moreover, by arguments similar to those used for codominant markers, we get 

p b (z = z) 



1 fll\ L K-l /5 X 1 



K V 15 



+ 



K \ 9 



And we get 



p b {z = z') 



1 fll\ K-l (5 
^ V 15/ + K \9 



Eventually, 



(50) 
(51) 
(52) 



(53) 



(54) 



p{£) 



25\ L 



l + (K-l)(§) 



(55) 



which proves equation (10). 



□ 
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